CUDA: set compute parameters via command line arguments #910
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds the ability to control the following CUDA inference parameters via command line arguments:
-DGGML_CUDA_FUSION, but I find it highly annoying that I need to recompile each time I want to test the difference between fusion and no fusion. Fusion improves TG speed by quite a bit (especially when fully offloaded), so ideally it should be enabled. But there are issues Bug: RoPE Cache lobotomizes GLM on my setup #893 and Bug: GPT-OSS 120b CUDA error: an illegal memory access was encountered #895 reporting problems, hence it is useful to enable or disable fusion via a command line argument.cmakeargument (GGML_CUDA_MIN_BATCH_OFFLOAD, see Better strategy for GPU offload #520). But in reality the threshold for which it is better to offload tensors stored in RAM to the GPU for matrix multiplications is dependent no only on the particular hardware (GPU vs CPU speed, speed of PCI-E bus), but also on the model and quantization type. Hence, it is handy to be able to set this threshold on the command linemmq_idmatrix multiplication kernels introduced in CUDA: muh faster prompt processing for MoE models and small u-batch sizes #728 get used, or if MoE matrix multiplication happen with the pre-728 approach. The reason I have added this is thatmmq_idleads to a non-negligible increase in perplexity, see comment and comment in CUDA: muh faster prompt processing for MoE models and small u-batch sizes #728.To set these parameters on the command line after this PR has been merged, one uses a comma separated list of
key=valuepairs like thisNot all parameters need to be present, only those that one wants to change from their default value. I.e.
or
is perfectly fine.
Fusion
This should not need further explanation. One just enables (value=1) or disables it (value=0). The default is currently disabled because of #893, #895. Depending on how the story around fusion evolves, I may add a finer-grained control of which fused operations to enable or disable later.
GPU offload threshold
This is relevant for hybrid inference. The GPU offload threshold controls above what batch size tensors stored in RAM will get offloaded to the GPU for computations. Let's call it
min_batch_size. It is set to GGML_CUDA_MIN_BATCH_OFFLOAD, which has a default value of 32 if not specified ascmakedefinition when building. For regular matrix multiplicationsmin_batch_sizeis the threshold. But for MoE tensors,ik_llama.cppuses as thresholdSo, for instance, for GLM-4.5/5-MoE, which has 128 total and 8 active experts, MoE tensors stored in RAM will only get offloaded for batch sizes
>= 32 * 128 / 8 = 512. The new parameter allows to control that. As an example, I have runllama-benchlike thisto measure performance as a function of prompt length (i.e., u-batch size) when tensors get always offloaded. Then the same benchmark but with
-cuda offload-batch-size=1000to measure the performance with tensors never being offloaded. The graph shows the resultWe see that for this particular model, GPU offload becomes faster somewhere around u-batch size of 150-160 tokens. Hence, it would be better to change the default via
-cuda offload-batch-size=10(10 because Qwen3-30B-A3B has 128 total and 8 active experts, so per above equation MoE GPU offload is done for batches greater than128/8 * 10 = 160)MMQ-ID threshold
For MoE matrix multiplications
ik_llama.cppuses themmq_idapproach added in PR #728, ifThis magic threshold has been determined experimentally. For larger batch sizes the original
ik_llama.cppMoE implementation tends to be faster. One can now verify by running, e.g.to have the
mmq_idapproach be never used, and then the same command but using-cuda mmq-id-size=1000to always usemmq_id.But apart from speed, @Nexesenex has expressed concern that
mmq_idincreases perplexity. If you are worried that this is significant, you can now disable it via-cuda mmq-id-size=0. It is worth noting that mainlinellama.cppsuffers from the exact same PPL increase (see #728 (comment)), and there is no way to disable it.